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Abstract 

Training structured prediction models is time-consuming. However, most existing 
approaches only use a single machine, thus, the advantage of computing power 
and the capacity for larger data sets of multiple machines have not been exploited. 
In this work, we propose an efficient algorithm for distributedly training structured 
support vector machines based on a distributed block-coordinate descent method. 
Both theoretical and experimental results indicate that our method is efficient. 


1 Introduction 

Many tasks in natural language processing and computer vision can be formulated as structured 
prediction problems, where the goal is to assign values to mutually dependent variables. The inter¬ 
dependencies constitute the “structure”. To fully exploit the rich representation of the structures, it is 
essential to use large amount of data. However, in practice, only a limited amount of data can be used 
to train a structured model because most current approaches for structured learning are confined to 
a single machine, which imposes a limit on memory and disk capacity. For linear classification, this 
problem has been addressed by distributed training algorithms (see, e.g., E851GHISID). However, 
there is little work on developing distributed algorithms for general structured learning. 

Moreover, most existing distributed training algorithms for linear classification rely on certain prop¬ 
erties of the objective function (e.g., differentiability). However, directly applying these methods 
to structured learning results in inferior convergence rates. For example, dissolve-strucQuses the 
framework in 0 for structured SVM, but this leads to a convergence rate that is only sublinear. 

There are several challenges in distributed structured learning. First, the features vectors, which 
extracted from both the input and the output structures, are often generated on-the-fly during the 
training process. Synchronizing their indices across different machines may introduce additional 
overhead. Second, the training time of an learning algorithm consists of three parts: 1) communica¬ 
tion, 2) inference, and 3) learning. It is important to balance these three factors. This is in contrast 
to linear classification, where communication is often the only bottleneck. 

In this work, we address these challenges and extend the recently proposed distributed box- 
constrained quadratic optimization algorithm (BQO) (TJ for structured support vector machines 
(SSVM) (ll ilj|. We show that the global linear convergence rate 0(log(l/e)) can be obtained, 
even if the objective function of SSVM is non-smooth. This result is substantial, because reduc¬ 
ing the outer iterations saves the time taken to solve the costly sub-problems. Moreover, the per- 
machine local sub-problems in BQO can be formed as small SSVM problems, which can be effi- 

*Most parts of this work was done when the authors were at University of Illinois. 

http://dalab.github.io/dissolve-struct/ 
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ciently solved by off-the-shelf solvers. This enables us to leverage the well-studied single-machine 
structured learning methods such as the dual coordinate decent algorithm 0. Experiments show 
that our algorithm is efficient and is therefore suitable for training large-scale structured models. 

Existing Works. A distributed structured Perceptron algorithm using the map-reduce framework 
is proposed in ED. A structured Perceptorn algorithm with mini-batch updates is discussed in ED- 
However, it is unclear how to extend their algorithm on a multi-core machine to a distributed setting. 
When the inference problem is formulated as a factor graph, ll9l lT2l proposed to split the graph-based 
optimization problem into sub-problems, where each problem deals with a sub-graph. Then each 
machine solves a sub-problem in parallel and communicates with each other to enforce consistency. 
The convergence rate of this type of approaches is unclear. Moreover, our approach distributes in¬ 
stances instead of sub-graphs and is more suitable for problems with unfactorable structures and/or 
many instances (e.g., parsing, sequence tagging, and alignment). A simple distributed implementa¬ 
tion of cutting plane methodjis also available. They solve the inference problems in parallel and 
use one machine to learn the model. This type of approaches requires many outer iterations, and 
they are empirically slow even in a single machine multi-core setting (see 0). 

2 Structured Support Vector Machine 

Given a set of observations {(aij, J/j)}* =1 , where x, £ X are instances with the corresponding 
annotated structure y i £ y t , and 3a is the set of all feasible structures for x,, SSVM solves 

min TO ,£ (1/2)w t w+cY] .£(&) s.t. w T ^){y, y t , ®j) > Mi, Vy £ Vi, (1) 

where C > 0 is a predefined parameter. <j>(y, y tl x ,) = >1>(a^, yj — $(:£*, y), and y) is the 
generated feature vector depending on both the input x and the structure y. £(£) is the loss term to 
be minimized, and the loss function A (y, y ,) > 0 is a metric that represents the distance between 
structures. In this paper, we consider the L2-loss, £(x) = x 2 ^\ 

We consider solving Eq. (|T]» in its dual form. Let a. be the vector of the dual variables with dimension 
IT I Vi |. ® be the Kronecker product, and e be the vector of ones, the dual of (0 can be written as, 

min^o f(a ) = (l/2)a T (Q + A/2C) a - v T a , 

Q(i,vi),u,v2) = Hyi,yi, x i) T Hy2,yj, x j)y * i < m < i,Vyi e Vi,Vy 2 £ y jf (2) 

A = (1 0 e) T (/ <g>e), V( i>y) = A(y i ,y),Vl < i < l,Vy £ y 

From the KKT conditions, the respective optimal solutions w* and cc* to eq. {T} and eq. ([2| satisfy 
= E a’“ y o(y. y, . x, ). For the ease of computation, we maintain the relationship between w 
and a during the optimization process, and treat w as a temporary vector. 

The key challenge of solving eq. (|2]i is that for most applications, the size of y, and thus the 
dimension of a. is exponentially large (with respect to the length of x L ), so optimizing over all 
variables is unrealistic. Efficient dual methods 0 maintain a small working set of dual variables to 
be optimized such that the remaining variables are fixed to be zero. These methods then iteratively 
enlarge the working set until the problem is well-optimized^] The working set is selected using the 
sub-gradient of 0 with respect the current iterate. Specifically, for each training instance x r , we 
add the dual variable a, y corresponds to the structure y into the working set, where 

y = argmaxj,^ w T (j)(y, y zl x,) - A(y t , y). (3) 

Once a. is updated, we update w accordingly. We call the step of computing eq. 0 “inference”, 
and call the part of optimizing eq. ([2]) over a fixed working set “learning”. When training SSVM 
distributedly, the learning step involves communication across machines. Therefore, inference and 
learning steps are both expensive. In the next section, we propose an algorithm that ensures fewer 
rounds of both parts. 

^http://alexander-schwing.de 

1 The dual form of LI-loss SSVM has an additional linear constraint, which can be viewed as a polyhedron. 
Thus the algorithm is still applicable and the convergence rate analysis technique is still valid. 

4 This approach is related to applying the cutting-plane methods to solve the primal problem q mm 
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Algorithm 1: A box-constrained quadratic optimization algorithm for solving 0 

1 . w ■*— 0, a •<— 0. 

2. Fort = 0,1,... (outeriteration) 

2.1. Use the current w to solve |4]) to get d distributedly in K machines. 

2.2. Use allreduce to obtain Aw in eq. 0 - 

2.3. Compute if by eq. 0 with another 0 ( 1 ) communication. 

2.4. a <r- a + 7yd; w <— w + if Aw. 


3 Distributed Box-Constrained Quadratic Optimization for SSVM 

We split the training data into K disjoint parts, and store them in K machines. Eq. 0 is a quadratic 
box-constrained optimization problem; therefore, we apply the framework in 0. At each iteration, 
given the current a and a symmetric positive definite //, we solve 

d = argmin d:a+d > 0 gii(d) = \/f(a) T d + ^ d T Hd . (4) 

We then conduct a line search to decide a suitable step size ry and update a •<— a. 4- ?yd. The detailed 
description is in Algorithm 111 Here, we consider H = OQ + + A I, where A > 0 is a small 

constant to ensure H >~ 0, 0> 0 can be tuned to decide how conservative the updates are, and 

= fO if i,j are not in the same partition, 

(*,j/i), 0 .W 2 ) y i ,x i ) T (t){y 2 ,y :j ,x j ) otherwise. 

The choice of H is based on two factors: 1) To converge fast, H should be an approximation of the 
real Hessian; 2) To solve eq. 0 without incurring communication cost across different machines, H 
should be decomposable to sub-matrices, where each sub-matrix uses information from data stored 
on one machine. Our design of H enables eq. ([4]) to be split into K sub-problems and solved 
locally. Each sub-problem can be rewritten as a SSVM dual problem. Thus, one can adopt any 
single-machine SSVM solver (e.g., Biigii30iia) to solve it. After Q is solved, we compute 

Aw = y2 di y <t>(y, (5) 

z —'ey 

by an allreduce operation that communicates information between machines. This information also 
synchronizes the model for conducting inferences to enlarge the working set. Using Aw, an exact 
line search for deciding the optimal step size if* can be conducted. 

df(a + ifd) * — V/(ct) T d w T Aw + a T (A/2C)d — v T d 

drj ~ V ~ d T (Q + A/2C)d ~ Aw T Aw + d T (A/2C)d 
To ensure feasibility, we take the final step size if to be 

if = min(max{ry / | a + if'd > 0}, ry*)- (6) 


Following the analysis in 0, we can show the following convergence result for Algorithms [T] 
Theorem 1. A Igorithm [7] has global linear convergence when the exact solution of 0 is obtained 
at each iteration and H 0. 

In practice, obtaining the exact solution of 0 is time-consuming. We show that global linear con¬ 
vergence still holds when (|4]» is solved approximately. 

Corollary 1. Let d* be the optimal solution of 0. If for some constant 7 £ [0,1) and for all t, the 
update direction d satisfies 'y\gii(d,*)\ < \gn(d)\ with H >- 0, then AlgorithmUjconverges with a 
global linear rate. 

Since 7 is arbitrary, for any sub-problem solver that strictly decreases the function value, we can 
easily obtain a value of 7 < 1. 

The communication step in eq. (|5]i requires machines to communicate a vector of O(n). The actual 
cost of this communication depends on the network setting and usually grows with K. We note 
that solving Q approximately results in more iterations and thus more rounds of communication, 
but requires fewer inference calls. Thus this is a trade-off between communication and inference. 
For many applications, inference is much more expensive than communication, thus the balance 
between these two factors is worth studying empirically. 
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(a) POS 





(d) DP 


Figure 1: la and lb Comparison between different algorithms using eight nodes, lc and Id 


Performance of BQO-Struct using different number of machines. Training time is in log scale. 


Model Consistency Unlike binary classification, while learning a structured model, features are 
usually generated on-the-fly because the feature set depends on the structures the solver has seen 
so far. If each machine maintains its own feature mapping, the feature indices will be inconsistent 
across machines. One potential solution is to synchronize the feature mappings at each round. 
However, this approach incurs a huge communication overhead. To tackle this issue, we adapt a 
feature hashing strategy in ifTTl . We map the features into integer values in [0, 2 d ), d £ AT by a 
unique hashing function and use them as new feature indices, such that the size of the weight vector 
is at most 2 d . The input to this hashing function can be any object, such as an integer or a string. This 
strategy has been used in distributed environments mo for dimension reduction and fast look-up. 
Here, as argued before, this techniques is crucial and efficient for distributed structured learning. 


4 Experiments 

We perform experiments on part-of-speech tagging (POS) and dependency parsing (DP). For both 
tasks, we use the Wall Street Journal portion of the Penn Treebank flOl with the standard split for 
training (section 02-21) and test (section 23). For both tasks, we set C = 0.1 for SSVM and compare 
the following algorithms using eight nodes in a local cluster. 

1. BQO-STRUCT: the algorithm we proposed in Section]!] We set 6 to be K. 

2. ADMM-STRUCT: the alternating directions method of multiplier ED- 

3. Distributed Perceptron: a parallel structured Perceptron algorithm described in ED- 

4. Simple average: Each machine trains a separate model using the local data. The final model 
is obtained by averaging all local models. 

The sub-problems in ADMM-STRUCT and BQO-Struct are solved by the dual coordinate de¬ 
scent solver proposed in 0, which is shown to be empirically faster than other existing methods. 
To have a fair comparison, we use the same setting for solving sub-problems when possible. 

Because different methods solve different objectives, we compare the test performance along train¬ 
ing time. Figure [T] shows the results. BQO-Struct performs the best in both tasks, confirming its 
fast theoretical convergence rate. We further investigate the speedup of BQO-Struct in Figures 


single-machine SSVM solver. For the time-consuming task DP, the speedup is significant because 
a large portion of the training time is spent on inference. Parallelizing this part can achieve nearly 
linear speedup. While for POS, because the training time using a single machine is already fast 
enough, using multiple machines does not improve the training time much. 

Overall, this work addresses the challenge of training structured SVM problems in a distributed 
setting and proposes an algorithm with fast convergence rate and good empirical performance. We 
hope this work will inspire more applications of structured learning with large volume of training 
data to improve the performance on structured learning tasks. This research was supported by the Multimodal 
Information Access & Synthesis Center at UIUC, part of CCICADA, a DHS Science and Technology Center of Excellence and by DARPA 
under agreement number FA8750-13-2-0008. The U.S. Government is authorized to reproduce and distribute reprints for Governmental 
purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should 
not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U. S. 
Government. 


lcpd This also serves as a comparison between our distributed algorithm and the state-of-the-art 
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